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RHO-ESTIMATORS FOR SHAPE RESTRICTED DENSITY 

ESTIMATION 

Y. BARAUD AND L. BIRGE 


Abstract. The purpose of this paper is to pursue our study of p-estimators built from 
i.i.d. observations that we defined in Baraud et al. (2014). For a p-estimator based on 
some model S (which means that the estimator belongs to S) and a true distribution of the 
observations that also belongs to S, the risk (with squared Hellinger loss) is bounded by a 
quantity which can be viewed as a dimension function of the model and is often related to 
the “metric dimension” of this model, as defined in Birge (2006). This is a minimax point 
of view and it is well-known that it is pessimistic. Typically, the bound is accurate for 
most points in the model but may be very pessimistic when the true distribution belongs 
to some specific part of it. This is the situation that we want to investigate here. For 
some models, like the set of decreasing densities on [0,1], there exist specihc points in the 
model that we shall call extremal and for which the risk is substantially smaller than the 
typical risk. Moreover, the risk at a non-extremal point of the model can be bounded by 
the sum of the risk bound at a well-chosen extremal point plus the square of its distance 
to this point. This implies that if the true density is close enough to an extremal point, 
the risk at this point may be smaller than the minimax risk on the model and this actually 
remains true even if the true density does not belong to the model. The result is based 
on some refined bounds on the suprema of empirical processes that are established in 
Baraud (2016). 


1. Introduction 

The present paper pursues the study of p-estimation, introduced in Baraud et al. (2014), 
as a versatile estimation strategy based on models. We want here to explain some specihc 
property of these estimators that we shall call superminimaxity, a study which was moti¬ 
vated by a conference that Adityanand Guntuboyina gave in Cambridge in June 2014. His 
talk was about Gaussian regression but we shall deal here with density estimation. Given 
n i.i.d. observations Xi,..., Xn with an unknown density s with respect to some reference 
measure p and an estimator S'(Xi,..., A„) of s, we measure its performance using the loss 
function h'^{s,'s) where h is the Hellinger distance. We shall focus here on p-estimators and 
some of their properties that lead to superminimaxity. 

The hrst of these properties is robustness. There exist various notions of robustness: 
robustness to model contamination, robustness to possible outliers, etc. — see Huber (1981) 
for some illustrations —. In some of these cases, the problem can be formulated in the 
following way. If we know the performance of an estimator when the true density s = s 
belongs to a model S, how does it deteriorate when s is actually of the form (1 — e)s + et 
for some small e G (0,1) and an arbitrary density t ^ s, that is, when a proportion e of 
the data actually corresponds to a sample of density t and not s. Since for such a density 
s one can check that h‘^{s,s) < e, it is natural to wonder what happens to the risk of the 
estimator not only when s is a mixture of the form (1 — e)s + et as before but also, more 
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generally, when it belongs to a small Hellinger ball around s, which leads to the notion of 
robustness with respect to Hellinger deviations that we shall use here. 

To illustrate the problem of contamination, assume that we choose as our statistical 
model S for the unknown density s the set of uniform densities on [0, 9] with 0 < 0 < 10, in 
which case the MLE (maximum likelihood estimator) is the uniform density s on , 

where -^(n) is the largest observation, with a risk s^)] bounded by C/n when s 

belongs to S, E^ denoting the expectation when the true density is t. But what if the true 
density s does not belong to SI Unfortunately, the situation may become quite different. 
If s is the mixture sq = (1 — l/n)l[o^i] + (l/n)(]l [9 ig]) for some n larger than 100, it is easy 
to check that, with probability of order 1 — e~^ ~ 0.63, at least one of the Xi will be larger 
than 9 and the MLE will be the uniform distribution on [O, with X(^n) > 9) which 

is a terrible estimator of s = sg, although the model S is quite good since the Hellinger 
distance between s and S is not larger than Ijy/n. 

The previous example shows that the MLE is definitely not robust in our sense since it 
may be very sensitive to small deviations from the model on the contrary to p-estimators. 
To be more precise, let us consider some model of densities S and a p-estimator s based on 
S with a risk function on S bounded by R{s,n), that is, 

(1) Es [/i^(s, s)] < R{s,n) for all s £ S. 

The robustness of p-estimators can be expressed by the following property, proven in Ba- 
raud et al. (2014): whatever the density s, 

(2) Es[/i^(s,s)] < Cg [i2(s,n) +/i^(s, s)] for all s € S', 

where Cg is a universal positive constant. This is a fondamental property of p-estimators 
for the following reasons: if s is quite close to a simple density s in S which can be estimated 
with a small risk bound R{s, n), the p-estimator will essentially behave as if the true density 
were s and the risk bound at s will be that at s plus a small additional term that can be 
viewed as a squared bias. Intuitively, a p-estimator based on a sample with density s and a 
p-estimator based on a sample with density s will remain close. In our parametric example 
based on uniform distributions, everything happens as if the p-estimator only considered 
the data with values in [0,1] and ignored the others. Consequently, its risk remains of order 
1/n even when s = sg instead of s = s = This notion of robustness is quite flexible 

and shows that the risk of the estimator does not deteriorate much in a small Hellinger 
neighbourhood of any point s of the model. 

Eor many well-chosen models 5, the risk can be uniformly bounded on S: 

(3) sup R{s,n) < R{S,n), 

s£S 

which corresponds to the minimax point of view, so that (3) leads to 

(4) Es [/i^(s, s')] < Cg [R{S,n) + /i^(s,5)] whatever the density s. 

It turns out that for some models S, there exists a subset U of 5 such that the risk bounds 
R{s, n) are substantially smaller than R{S, n) for all s € U. This is what we call su- 
perminimaxity. Although there exists some analogy in the denomination, the notion is 
quite different from the one of superefRciency as described in the famous counterexample of 
Hodges and the Theorem of Le Cam about points of superefficiency, apart from the fact that 
it deals with the property that estimation is faster at some points. However superefRciency 
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is an asymptotic property at a point, while superminimaxity on V is definitely nonasymp- 
totic and defined for a given value of the number n of observations. For a detailed study 
of superefficiency, one could look at the paper by Brown, Low, and Zhao (1997). More¬ 
over, superminimaxity on V together with the robustness of /^-estimators has the following 
consequence: if s is either in V or close enough to it, the risk bound at s, 

Eg s')] < Co m^ [i?(s, n) -|-/i^(s,s)] , 

may be substantially smaller than the typical risk bound (4) leading to superminimaxity at 
s. It is actually the combination of the robustness of p-estimators and the existence of local 
risk bounds of the form (1) that lead to this phenomenon as also described, in a different 
framework, by Chatterjee et al. (2015), a paper that strongly influenced our research in 
this direction. 

Showing that the risk Ej (s, s)] at some particular points s can be bounded from 
above by some quantity R(s, n) which is of smaller order than the global minimax risk over 
S requires some specific probabilistic tools that have been established in Baraud (2016). 
These tools allow to bound the expectation of the supremum of an empirical process over 
the neighbourhood of an element s £ S' by some quantity which is of smaller order than 
that one could get by using the global entropy of the class S as, for example, in van de 
Geer (1993). 

The existence of points in the model on which the estimator is superminimax was already 
noticed for the Grenander estimator of a non-increasing density — see Grenander (1981) 
and Groeneboom (1985) — on an interval [a,-|-oo) with a known value of a. It is shown 
in Birge (1989) that the Li-risk of the Grenander estimator of a non-increasing piecewise 
constant density based on at most D intervals is bounded by c^jD/n, for some positive 
universal constant c, and can therefore be of smaller order than the typical risk for non¬ 
increasing densities which is of order We shall see below that, for the same estimation 

problem, the p-estimator will perform similarly (up to possible logarithmic factors) with the 
same superminimaxity property on piecewise constant densities. Moreover, the p-estimator 
does not need to know a and is robust with respect to the Bellinger distance. 

The case of monotone densities on [o, -|-oo) is far from unique. There are many other 
examples of families S of densities for which one can find a subset V of 5 on which the 
rates of convergence of the p-estimator are faster than the rate at a “typical” point of S. 
Moreover, it happens that the set V often possesses good approximation properties with 
respect to the much larger space S. These approximation properties combined with the 
robustness of p-estimators as expressed by (2) allow to derive non-asymptotic minimax risk 
bounds over large subsets of S. Such sets are possibly non-compact and therefore neither 
possess a finite metric dimension nor a finite entropy. 

In view of illustrating this superminimaxity phenomenon, we shall consider in the present 
paper models of densities S defined by some shape constraints, namely piecewise monotone, 
piecewise convex or concave and log-concave densities. There is a large amount of literature 
dealing with these density models and we shall content ourselves to mention a few references 
only and refer the reader to the bibliography therein. For monotone densities we refer to 
the books by Groeneboom and Wellner (1992) and van de Geer (2000). For the estimation 
of a convex density, we mention Groeneboom et al. (2001) and refer to the papers of Doss 
and Wellner (2015), Diimbgen and Rufibach (2009), Cule and Samworth (2010) for the 
estimation of a log-concave density. In the regression setting, let us mention Guntuboyina 
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and Sen (2015) for estimating a convex regression function and Chatterjee et al. (2015) 
for the isotonic regression. Recently, Bellec (2015) extended the results of Chatterjee et 
al. (2015) about the properties of least-squares estimators over convex polyhedral cones 
in the homoscedastic Gaussian regression framework, to general closed convex subsets of 
M” from which he also derived some results of superminimaxity in this specihc Gaussian 
framework. In these two papers the results are restricted to convex models. As opposed, 
convexity does not play any special role in our presentation and the models we shall use 
here are not necessarily convex which allows us to deal with more general shape constraints 
like piecewise monotonicity or log-concavity. 

The paper is organised as follows. The statistical setting, main notations and conventions 
as well as a brief reminder of what a p-estimator is in the density estimation framework 
can be found in Section 2. The introductory example of the model of monotone densities 
in Section 3 gives a hrst flavour of the results we establish all along the paper. The main 
result can be found in Section 4 and its applications to different density models (piecewise 
constant, piecewise monotone, piecewise convex-concave and log-concave densities) are de¬ 
tailed in Section 5. The problem of model selection is addressed in Section 6 and Section 7 
is devoted to the proofs. 


2. The statistical setting 


Let be a measurable set, // a u-hnite measure on the set of all 

probabilities on (^, s^) which are absolutely continuous with respect to p. We shall denote 
by the set of real-valued functions from to M and the subset of consisting 
of those functions t > 0 satisfying f tdfi = 1, that is the set of probability densities with 
respect to p. An element of with density t G will be denoted by Pt- We turn into 
a metric space via the Hellinger distance h. We recall from Le Gam (1973 or 1986) that 
the Hellinger distance between two elements P and Q of is given by 


h{P,Q) = 


'jr 


dP/d^ — \JdQ/dfi^ d/i 


- 11/2 


For Pt, Pu S P^i with t,u & we shall write h{t, u) for h{Pt, Py). 

We observe n i.i.d. random variables Xi, ..., A„ with values in (,!?/, j?/) and distribution 
Ps for some density s G .if^. Although s might not be uniquely dehned in as a density 
with respect to p of the distribution of the observations, we shall refer to s as “the” density 
of Ps for simplicity. To avoid trivialities, we shall always assume, in the sequel, that n > 3, 
so that logn > 1. The estimators S' that we shall consider here will be based on models for 
s defined as follows. 


Definition 1. A density model, or a model (for short), is a subset of for which there 
exists an at most countable subset S C S such that {Pt, t G S} is dense in {Pt, t G S} 
with respect to the Hellinger distance. We shall then say that S is dense in S and that S 
is separable with respect to the Hellinger distance. 


A density model S should be chosen so that the corresponding probability model {Pt, t G 
5} approximates the true distribution Ps (with respect to the Hellinger distance). The 
model S may or may not contain s. Of course a model S is good only if the distance 
h(^s,S) is not too large, where we set, for A C h{t,A) = inf„g^/i(t, tt). Our aim in 
this paper is to study the performance of a p-estimator S' of s built on S. The dehnition 
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and properties of p-estimators have been described in great details in Baraud et al. (2014) 
and we only give below a brief account of what a p-estimator is. 


2.1. What is a p-estimator? In the context of density estimation based on i.i.d. variables, 
which is the one we consider here, a p-estimator provides a robust (in our sense) and an 
(almost) rate optimal estimator over a model S of densities in all cases we know. In order 
to avoid long developments we restrict ourselves to its construction in the specific situations 
we shall encounter here, namely when the observations are i.i.d. 

Let '0 be the increasing function from [0, +oo] onto [—1,1] defined by 

XL — 1 

i^{u) = for u € [0,+oo) and 'ijj^+oo) = 1. 

y/l +v? 


Given a model S of densities on (=^,j 2 /,p) and a countable and dense subset 5 of 5, a 
p-estimator s of the density s on S' is defined in the following way. For densities t,t' £ 
we set 






1=1 



and define s' as any (measurable) element of the closure of the set 


( 5 ) 


t £ S 


r(S,t) < inf T(S,0 + 35.7 1 
t'^s j 


with T(S,t) = supT(X,t,t'). 
t'es 


In the calculation of T( X, t, t'), which involves the ratio t'/t, we use the convention 0/0 = 1 
and a/0 = +oo for a > 0. The constant 35.7 in (5) has only been chosen for convenience 
in the calibration of the numerical constants in the original paper Baraud et al. (2014) and 
can be replaced by any positive number. It is clear from the construction that, given a 
model S, there is not a unique p-estimator on S. However, the risk bounds we derived in 
Baraud et al. (2014) are valid for any version of these p-estimators. 


2.2. Notations, conventions and definitions. We set log_,_(x) = max{logx,l}, N* = 
N \ {0}, aV b = max(a, b), a Ab = min(a, b) and, for x £ M_|_, [x] = inf{n G N, re > x}; 
|H| denotes the cardinality of the hnite set A and C,C',... numerical constants that may 
vary from line to line. For a function / on M, f{x+) and f{x—) denote respectively the 
right-hand and left-hand limits of / at x whenever these limits exist. We shall also use the 
following conventions: = 0, x/0 = -|-oo if x > 0, x/0 = —oo if x < 0 and 0/0 = 1. 

Definition 2. A partition of the open interval {a,b) (—oo < a < b < -|-oo/ of size A: -|- 1 
with k £ is either 0 when k = 0 or a finite set X = {xi,... ,Xk} of real numbers with 
a < xi < X 2 <■■■< Xk < b if k > 1. We shall call endpoints of the partition X the numbers 
Xj, 1 < j < k, and intervals of the partition the open intervals Ij = (xj, Xj+i), 0 < j < A: 
with xo = a and x^+i = b. A partition X will also be identified to the set of its intervals 
and we shall equally write X = {/q, • • • ,7fc} orX = {xi,... ,Xfc}. 

The set of all partitions of M with k endpoints or A: -|- 1 intervals is denoted by 77(A: -|- 1) 
and the length of Ij by i{Ij). If X = {xi,...,Xfc} and X' = {x'^,..., x),/}, X y X' = 
{xi,..., Xfc} U {x'^,..., x/,} and XAX' means that {xi,..., x^} D {x'^,..., x^,}. 
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3. Monotone densities 


In view of illustrating the main result of this paper to be presented in Section 4, let 
us consider the example of the model S consisting of all the densities with respect to the 
Lebesgue measure // that are non-increasing on some arbitrary interval of M which is open 
on its left end and vanish elsewhere. In this case = (M, and S is the set 

of all densities of the form t = fi(x,+cxi) with x G M and / is a non-increasing and non¬ 
negative function on (x, -|-oo) (which may be unbounded in the neighbourhood of x) such 
that f- f{x)dx = 1. The results we get below would be similar for the set of all densities 
which are non-decreasing on some interval of M and vanish elsewhere. 

For L) G N* we define V{D) to be the set of all densities of the form 
with X = {xi,... ,x_D+i} G J{D + 2) and Oj > 0 for 1 < j < Z). Note that the densities 
in V(D) take the value 0 on the two unbounded extremal intervals Iq and Id+i of the 
partition I. For instance, 1^(1) corresponds to the family of uniform densities on intervals, 
that is 

F(l) = {t(-) = - ^o), 0i > 0, 00 G M} . 

In such a situation, we can prove the following result. 


Theorem 1. Any p-estimator's on S, as defined in Section 2.1, satisfies 

D 


_ _ I j / Ti \ 

(6) CE4h2(s,s)] < hif^ h\s,V{D)nS) + -\ogl[-) 

and some universal constant C G (0,1]. 


for all s G Af/j, 


Remark. Since C < 1, the left-hand side is always bounded by one so that it is useless 
to consider values of D that lead to a bound which is not smaller than one, in particular 
D >n, and (6) is actually equivalent to 

Ay«,F(I))ns) + klogl(T) , 

Although we shall not repeat it systematically, the same remark will hold for all our subse¬ 
quent results. 


C'Es[/i^(s,s)] < ^Jnf 


Bound (6) means that the risk function s e-)• E^ [h‘^{s, s^)] of s over can be quite small 
in the neighbourhood of some specific densities t G 5: if s belongs to V(D) n S with D < n 
or is close enough to some density t G V{D) n S, the risk of s is of order D/n, up to 
logarithmic factors. More precisely, 

sup Es[/i^(s,?)] <C' —logi]_(-—j for l<D<n. 

When n becomes large and H remains fixed, the rate of convergence of s towards an element 
of V(D) n 5 is therefore almost parametric. 

Of particular interest are the densities t which are bounded, supported on a compact 
interval [a,b] of M (for numbers a < b depending on t) and non-increasing on {a,b). Given 
M > 0, we introduce the set S{M) of densities t of this form and for which 

(7) {b-a)Vl,^(^Vt)=Mit)<M, 

where the variation V^a^] (\/t) of the non-increasing function ^/i on [a, b] is defined in the 
following way: 
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Definition 3. Let the function f be defined on some interval I (with positive length) o/M 
and monotone on the interior I of I. Its variation on I is given by 

(8) Vi{f) = sup/(x) - inf/(x) G [0,+oo]. 

x&i 


Note that iS'(O) is the set of uniform densities on intervals, so that 5(0) = F(1), and that 
S{M) is not compact and contains densities that can be arbitrarily large in sup-norm. The 
functional M remains invariant by translation and scaling: if «(•) = At(A(- — r)) with A > 0 
and r G M, then M{u) = M{t) which implies that S{M) is also invariant by translation 
and scaling. It turns out that the densities lying in S{M) can be well approximated by 
elements of V{D). More precisely, the following approximation result holds. 

Proposition 1. For all D G N* and t G Um>o 

h^{t,V{D)nS) < [M{t)/{2Df] Al. 


Using the triangle inequality, the right-hand side of (6) can be bounded from above in 
the following way: for all M > 0 and t G S{M), 

D 


inf 

D>1 


< 2h?{s, t) + inf 

D>] 

< 2hfi{s, t) + inf 


D>1 


2h^{fiV(D)l^S) + — log3 
n 

M D 


2D2 




Finally, since t is arbitrary in S{M) 

D 


inf 

D>1 


h^ {s, V{D) ns) + - log3 < 2h^ {s, S{M)) + 


M D / n 


Optimizing the right-hand side with respect to D and using the facts that M is arbitrary 
and logn > 1, we derive the following corollary of Theorem 1. 


Corollary 1. For all probabilities Pg in any p-estimator of s on S satisfies, for some 
constant C G (0,1], 

(9) CEs s)] < jnf 

M>0 


h‘^{s,S{M)) + ^/^(logre)^j \y (n ^(logn)^)j 


In particular, if s G S{M) for some M > n ^(logn)^, the risk bound of the estimator 

is not larger (up to a universal constant) than M n“^/^(logn)^ while for smaller values 
of M it is bounded by n“^(logn)^. Up to logarithmic factors, this rate (with respect to 
n) is optimal since it corresponds to the lower bound of order for the minimax risk 

on the subset of S{M) consisting of the non-increasing densities supported in [0,1] and 
bounded by M. This lower bound follows from the proof of Proposition 1 of Birge (1987). 
The result was actually stated in this paper for the Li-distance but its proof shows that it 
applies to the Hellinger distance as well. This property means that, although the set S{M) 
is not compact because the support of the densities is unknown, the minimax risk on S{M) 
is finite. We do not know any other estimator with the same performance which is also 
robust with respect to Hellinger deviations. 
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Note that Corollary 1 can also be used to determine the rate of estimation for decreasing 
densities s with possibly unbounded support and maximum value, provided that we have 
some assumption about the behaviour of the function M i—)• h(^s, S{M)^ when M goes to 
infinity. 


4. The main result 
Let us start with some definitions. 

Definition 4. A class ‘if of subsets of ^ is said to shatter a finite subset A = {xi,... ,Xm} 
of X if the class of subsets 

( 10 ) ‘fnA = {CnA,CG‘f} 

is equal to the class of all subsets of A or, equivalently, if 11 Aj = 2™. A non-empty 
class ‘tf of subsets of is a VC-class with dimension d G N i/ there exists some integer m 
such that no finite subset A G ^ with cardinality m can be shattered by and d 1-1 is the 
smallest m with this property. 

Definition 5. Let be a non-empty class of functions on a set X with values in [—oo, +oo]. 
We shall say that ^ is weak VC-major with dimension d G if d is the smallest integer 
k gN such that, for all u gM., the class 

(11) ^n(^) = {{x G fix) >u}, fG 

is a VC-class of subsets of fC with dimension not larger than k. 

We may now introduce the main property to be used in this paper. 

Definition 6. Let be a class of real-valued functions on 3C. We shall say that an element 
f G ^ is extremal in (or is an extremal point of ) with degree d(/) = d V 1 G N* if 
the class of functions 

i^/l) = {///, / G 

is weak VC-major with dimension d. 

Proposition 2. Let ^ be a class of nonnegative functions on X. The element f G X is 
extremal in X with degree not larger than 2d if for all A > 0, 

-^(^,7, \) = {X}u{{xGX\ fix) - XJix) > 0}, / G X} 

is a VC-class with dimension not larger than d > 1. 

Proof Let us bound the VC dimension of jf)) according to the value of n G M. If 

u < 0, ‘j^uiiXjf)) = {X} and is therefore VC with dimension not larger than 0 < 2d. Let 
us now assume that u >0 and set A = {f > 0}. Using Lemma 5 (in Section 7.1), it suffices 
to prove that jf)) n A and ‘tfuiiXff)) C A‘^ are two VC-classes with dimensions not 

larger than d. For all x G ^ and f G X 

fix)/fix) > u is equivalent to fix) > u/(x) 

showing thus that ^uiiX/f)) (1 A G ‘^iX, f ,u) C A and is therefore VC with dimension 
not larger than d. Let us now turn to the case where x ^ A, which means that fix) = 0 so 
that fix)/fix) is either 1 or +oo (with our conventions). For u > 1 and all x G A‘^, 

fix)/fix) > u is equivalent to fix) > 0 
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and ^„((^//)) n C /, 0) n and is therefore VC with dimension not larger than 

d. For u G [0, 1), {f /f){x) > u for all x G A'^, hence ^u((^//)) Cl A'^ = {A^} which is VC 
with dimension 0 < d and this concludes the proof. □ 


Let us now state our main result. 


Theorem 2. Let S be a model with a non-void set A of extremal points. Any p-estimator 
's on S satisfies, for some universal constant C G (0,1], 


P. 


Chf{s,fi) < in£ 

seA 


h?{s,s) + ^^log^ 


n 


n 


d{s) 


n 


> 1 — e ^ for all ^ > 0, 


whatever the true distribution Pg G 7A- Consequently, 


CE,[h2(s,s)] < inf 
seA 


h"‘(s, s) + log?^ 


n 


n 


d{s) 


Note that the boundedness of h implies that values of difs) > n lead to a trivial bound so 
that the infimum could be reduced to those s such that d{s) < n. We do not know to what 
extend the log^ factor is necessary. We believe that it is not optimal although a log-factor 
appears to be necessary in some situations as shown by the example of Section 5.1 below. 


5. Applications 


Throughout this section, = (M, and p, is the Lebesgue measure on M. In 

particular, we shall only consider densities with respect to the Lebesgue measure. We start 
with the following useful lemma: 

Lemma 1. If ^ is a class of subsets o/M such that each element of Ql is the union of at 
most k intervals, is VC with dimension at most 2k. 


Proof. Let xi < X 2 < ... < X 2 k+i be 2A: -|- 1 points of M. It is easy to check that elements 
of the form Ji U • • • U G where the Jj are disjointed intervals and I < k, cannot pick 
up the subset of points U^=o{® 2 i+i}- Cl 


5.1. Piecewise constant densities. Let us now consider the model V{D) of Section 3 
to build a p-estimator. If / and / belong to V{D), for all A > 0, / — A/ is of the form 
with k < 2{D + 1) so that {x G .^\f{x) — Xf{x) > 0} is the union 
of at most H -|- 1 disjointed intervals. Applying Lemma 1 and Proposition 2 to the sets 
^ = ^{V{D),f,\) with A > 0, we obtain that all the elements of V{D) are extremal in 
V(D) and their degrees are not larger than 4(iA -|- 1). We therefore deduce from Theorem 2 
that 


( 12 ) 


sup Es s^)] < C—log+f 
seV(D) ” 


n 

\D 


which, up to the logarithmic factor, corresponds to a parametric rate (with respect to n) 
although the partition that defines s can be arbitrary in J{D + 2) and the support of s 
is unknown. It follows from Birge and Massart (1998), Proposition 2, that a lower bound 
for the minimax risk on V{D) is of the form C'{D/n)\ogj^{n/D), which shows that some 
power of log_|_(n/iA) is necessary in (12). We suspect that the power three for the logarithm 
is not optimal. 
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5.2. Piecewise monotone densities. Let us now see how Theorem 2 can be applied in 
the simple situation of piecewise monotone densities. 

Definition 7. Given fc £ N* and a partition X = {Iq, ... , I^-i} £ a real-valued 

function / onM will be called pieeewise monotone (with k pieces) based on I iff is monotone 
on each open interval Ij, ^ < j < k — 1. The set of all such functions will be denoted by 
For k >2 (since no density is monotone on MJ, is the set of densities (with respect 
to the Lehesgue measure) that belong to 

Clearly, C and C for alH > A:. 

Proposition 3. For D > 1 and k > 2, any element f of n V{D) is extremal in 

with degree not larger than 3{k + D + 1). 

Proof. Let / be a piecewise monotone density on M based on a partition Iq G J'(k), therefore 
with A:—1 endpoints. Let f GV{D)r\^k be a piecewise constant density based on a partition 
Xi £ + 2) (with D + 1 endpoints) and let X 2 = Xi V Xq. It is a partition of M with at 

most k-\- D endpoints, therefore at most A; + D + 1 intervals and on each such interval / is 
monotone and / is constant which implies that / — A/ belongs to ^k+D-{-i for all A > 0. It 

then follows from Lemma 2 below that the sets {x G ^ \ f{x) — Xf{x) > 0} are unions of at 

most (3/2)(A: + D + I) intervals. The conclusion follows from Lemma 1 and Proposition 2 
applied to ^ /, A) with A > 0. □ 

Lemma 2. If f G whatever a £ M the set {x G 3F \ f{x) > a} ean be written as a 
union of at most A: + [(A: — l)/2] < 3k/2 intervals and the set {x G \ f{x) < a} as well. 

Proof. Let / £ X be the partition of M with k open intervals associated to / and 
xi,..., Xk-i the A; — 1 endpoints of this partition. For Ij G X, {x G 3F \ f{x) > a} n Ij is 
either 0 or a non-void interval and 

( k \ /fc-1 

U [{/ > «} n Ij] 1 U ( U ^ {/ > «}] 

This decomposition shows that {x £ ,^ | /(x) > a} is the union of at most 2A; — 1 disjointed 
intervals. Nevertheless, this bound can be refined as follows. If f{xj) < a, {xj} n {/ > 
a} = 0 and the only situation we need to consider is when f{xj) > a in which case 
{xj} n {/ > a} = {xj}. If Xj belongs to the closure of one of the intervals of the form 
{/ > 0 } n Iji, the set [{/ > 0 } n Iji~\ U {xj} only counts for one interval in (13). The only 
situation for which {xj} adds an extra interval occurs when f{xj-i+) > a, f{xj—) < a, 
f{xj) > a, f{xj+) < a and /(xj+i—) > a. The number of such points Xj is not larger 
than [(A: — l)/2] and {x £ .^ ] /(x) > a} is therefore the union of at most k \{k — l)/2] 
intervals. The proof for {x | /(x) < a} is the same. □ 



An application of Theorem 2 with A = fj 
extremal in by Proposition 3, leads to the fol 


owing result. 


, the elements of which are 


Corollary 2. For all k >2, any p-estimator on satisfies, for all distributions Pg G Vn, 


C'Es[/i^(s,s)] < inf 


D>1 


inf_ 

s&.^knv{D) 


,2/ —3 

h {s,s) H-—log+ 


n 


n 


k + D 


where C G (0,1] is a universal constant. 
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Note that the bound is trivial for k>n—l and that using D with k + D > n also leads 
to a trivial bound so that we should restrict ourselves to D < n — k when k < n — 1. 

Since, for t G 

inf_ /i(s,s) < inf h{s,t) + inf_ h(t,s) , 
se^kHViD) [ se^^^nviD) 

to go further with our analysis it will be necessary to evaluate h{t,s) for 

t € In order to do this we shall use an approximation result based on the following 
functional M^. 


Definition 8. Let t G ^k for k >2 and X = {/q, ..., Ik-i} £ -J(k) a partition on which t 
is based. Using the convention (+oo) x 0 = 0, we define 


(14) 


Mkit,X) = 


k-l 


E [‘OiH 


il/3 


j=0 


< + 00 , 


where Vj^ {Vt) is the variation of y/t on Ij given by (8). The functional Mk is defined on 
J^k os 

Mk{t) = inf Mk{t,I) for all t G 

where the infimum runs among all partitions X G JiU) on which t is based. For 0 < M < 
+00 and A; G N*, we denote by .t^k+ 2 {M) the subset of .^k +2 of those densities t such that 
Mk+2{t) < M. 


Note that with our convention, if Mk{t,X) < +oo, t is equal to zero on /oU/fc_i, in which 
case the summation in (14) can be restricted to 1 < j < /c — 2, which requires that k > 2, 
and that =^fc_|_2(0) is equal to V{k) (in the Li sense). The functional Mk is translation and 
scale invariant which means that it takes the same value at t and X~^t{{- — t)/X) whatever 
A > 0 and r G M. Besides, it possesses the following property. 

Lemma 3. For all I > k and t G C Mfit) < Mk{t). 


Proof. Let X G c7(fc) on which t is based. For all partitions X' G ff{l) satisfying X' PX, t 
can be viewed as an element of based on X' and consequently it suffices to show that 
Mi{t,X') < Mk{t,X). In fact, it suffices to show that, when we simply divide an interval J 
of length L of X into m intervals Ji,... ,Jm of respective lengths Li,..., Lm, 


E fin 





1/3 


<LV] 



m m 

when Lj = L and E rj, (^/^) < Vj (^/^), 

1=1 1=1 


Setting Lj = ajL and Vj. {Vi) = fijVj {Vi), this amounts to show that YlJLi ^ 

which follows from Holder’s Inequality. □ 


The approximation of elements of J^k+ 2 {M) by elements of V{D) is controlled in the 
following way. 

Proposition 4. Let k > 1 and t G ^k +2 with Mfc_|_ 2 (t) < +oo. Then, for all D >1, 

V{t,V{D + fe) n ^k+ 2 ) < {2D)-^Mk+2{t). 
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Applying Corollary 2 leads to the following bound which is valid for all t G ^k +2 with 
Mk+ 2 {t) < +00 and whatever the distribution Pg of the observations: 


(15) CEs[/i^(s,s')] <t) + ^nf 


Mk+2{t) I k + D^ ^3 
H iogi 


n 


k + D 


A>2 n 

A final optimization with respect to D leads to 

CEs[/i^(s,s)] < h^{s,t) + [Mk+ 2 {t)]^^^ n~‘^^^{logn)‘^ + kn~^{logn)^. 

Since this result is valid for all densities t G ^k+ 2 , we can again optimize it with respect 
to t which finally leads to; 

Theorem 3. Any p-estimator s based on the model ^k +2 for some k>\ satisfies 

CE, [h^{s, s)] < Jnf (s, ^k+ 2 (M)) + (log nf ( (Mn-^) W (A:n“^ log n 
M>0 L V V 

for all distributions Pg £P^- In particular 

sup Es[/i^(s,s^)] < C(logn)^ '\l [kn~^ log n) 

s£^k+2{^) 


If we want to estimate a bounded unimodal density s with support of finite length L, we 
may build a p-estimator on ^ 4 . In such a case, M^^s) can be bounded by 4L||s||oo >4 (since 
s is a density, L||s||oo > 1 ) and the performance of the p-estimator for such a unimodal 
density s will be given by 

E. [h\s, s)] < C (L||s||oon- 2 )(log nf. 

5.3. Piecewise concave-convex densities. In the previous sections we considered den¬ 
sities t which were piecewise monotone or constant which implied the same properties for 
y/i but it follows from Proposition 4 that it is actually the approximation properties of 
y/i that matter. This derives from the fact that the Hellinger distance is an L 2 -distance 
between the square roots of the densities. When going to more sophisticated properties 
than monotonicity, it is no more the same to state them for t or for y/i which accounts for 
the slightly more complicated structure of this section. 

Definition 9. Let I G II{k) be a partition with k intervals. A function f is piecewise 
convex-concave based on I if it is either convex or concave on each (open) interval Ij of the 
partition. The set of all such functions when Z varies in J{k) will be denoted by For 
D G W we denote by Wi{D) the set of all functions 7 of the form 7 = Ylf=i7j^(xj,xj+i] 
with xi < X 2 < ■ ■ ■ < XD+i where 7 ^ is an affine function for all j. The sets and Vi{D) 
are the sets of those densities t such that y/t belongs to and Wi{D) respectively. 


We recall that if / is either concave or convex on some open interval I, it is continuous 
on I and admits on I a right-hand derivative /' which is monotone. 

The following result will prove useful to find extremal points of 
Lemma 4. For all k G W, C ^ 2 k- 

Proof. Since f G^^ there exists Z G J(k) such that / is either convex or concave on each 
open interval Ij of Z. The right-hand derivative f of / on Ij being monotone, the sets 
{x G Ij, f < 0} and {x G Ij, f) > 0} are two disjointed subintervals of Ij on which / is 
monotone. □ 


12 











Proposition 5. For all D,k G N*, the elements f G n Vi{D) are extremal in with 
degrees not larger than 12{D + fc + 1). 

Proof. Let us consider g G and g G r\ Wi{D). There exists a partition Iq with k — 1 
endpoints such that g is either convex or concave on each interval of Iq and a partition Ii 
with D + 1 endpoints such that g is affine on each interval of Zi. The partition Xq V Zi 
contains at most k + D + 1 intervals and on each of these intervals g — Xg is either convex 
or concave for all A G M+. Hence, the function g — Xg belongs to which is a 

subset of 1 ^ 2 (fc+Z)+i) by Lemma 4. It then follows from Lemma 2 that {g — Xg > 0} is 
the union of at most 3{k + D + 1) intervals. Since A is arbitrary in M+ we conclude with 
Lemma 1 that is VC with dimension not larger than 6 (iA +A; + 1), which shows by 

Proposition 2 that the elements 'g G '^^G\Wi{D) are extremal in with degrees not larger 
than 12{D + k + 1). The conclusion follows from an application of Lemma 5 of Section 7.1 
with a = 1/2. □ 


We may now apply Theorem 2 with A = f] 
points of and deduce the following risk bounc 


from Proposition 5: 


which consists of extremal 


Corollary 3. For all k >2, any p-estimator on satisfies for all distributions Pg G Vn, 


C'Es[/i^(s,s)] < inf 


D>1 


k 

, 2 / 


n 


_inf h^{s,s) log+\ 1 -p: 

seVi{D)n^l n ^\k + D 


where C G (0,1] is a universal constant. 


The control of the approximation term h{s,s) is analogue to the one we 

derived in the previons section for inf-^^^ h{t,s) but is based on a new functional: 


Definition 10. Let t G and Z = {Jq, ... ,Ik-i} £ •17(A:) a partition on which t is based, 
that is, \/t is either convex or concave with monotone right-hand derivative {Vt)' on each 
Ij. Using the convention (+oo) x 0 = 0, we define 


Mk,i{t,Z) 


k-l 

E 

j=0 





< Too, 


where Vj. j is the variation of {V^' on Ij. The functional is defined on as 

Mk,i{t) = inf Mkp{t,Z) for all tG.^1, 

where the infimum runs among all partitions Z G fl{k) on which t is based. For 0 < M < 
+00 and k gW, we denote by =^/_|_ 2 (M) the subset o /^/_,_2 of those densities t such that 
XIk+2,i{t) < M. 


Note that if Mkp{t,Z) is finite, the density t is necessarily zero on the two extremal 
(unbounded) intervals of the partition Z and therefore k > 3. An analogue of Lemma 3 
holds for the functional Mkp{t) with a similar proof, saying that if I > k and t G C If} , 
then Mi^i[t) < Mkp{t). We omit the details. 
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Proposition 6. Let k > 1 and t E ^^+2 -^fc+2,i(0 < +oo. Then, for D >\, 




Now arguing as we did in the previous section we derive from Corollary 3 and Proposi¬ 
tion 6 our concluding result. 

Theorem 4. Any p-estimator s based on the model <^1+2 A: > 1 satisfies, for all 

Ps S Pfl, 


C'Es[/i^(s,s)] < jnf 2 (M))-|-("(Mn ^)^'^^(log 

M>o L 


n 


{kn ^(logn)^) 


If, in particular, s E then 

Es[/i^(s,s)] < C (logre)^^/®^ \J (A:re“^(logn)^) 


5.4. Log-concave densities. We now want to investigate a situation which is close to 
the previous one, the case of log-concave densities on the line. These are densities of the 
form 1/ exp(g) for some open interval I of M, possibly of infinite length, and some concave 
function g on I. Let us denote by the set of all such densities and by V'{D) the subset 
of of those densities for which g is piecewise affine on I with D pieces. For instance, 
the exponential density belongs to V (1) while the Laplace density belongs to V (2). Also 
note that if ]l/exp(g) is log-concave, the same holds for its square root l/exp(g/2). 

Proposition 7. For all D E N*, the elements of V'{D) are extremal in with degrees 
not larger than 12{D -|- 2) -|- 4. 


Proof. Let us consider l/exp( 5 r) E and ljexp( 5 r) E V'{D). Then the set on which 
1/ ex.p{g) > Alj exp( 5 ), with A > 0 is the subset of I on which 

g > log X + log Ij + g, 

with the convention that logO = — 00 . If A = 0 it is the set I itself. Otherwise it is equal 
to the union of IO and / nJn{ 5 ' — log A}. Since on the interval / n J, 5 is concave 
and g piecewise affine with at most D pieces, the function h = [g — 5 )l/nj + (log A)l(/nj)': 
is piecewise concave on M with at most D + 2 pieces. Hence h belongs to f ^^+2 
Lemma 4 it also belongs to f^ 2 (_D+ 2 ) it follows from Lemma 2 that IriJri{g — 'g> 
log A} = {h > log A} is the union of at most 3{D + 2) intervals. Consequently, the set 
{1/ exp( 5 r) > Alj expfg)} is the union of at most 3{D -|- 2) -|- 1 intervals and we derive from 
Lemma 1 that the VC-dimension of , Ij exp{'g)) is not larger than 6{D -|- 2) -|- 2. The 
conclusion then follows from Proposition 2. □ 

We may now apply Theorem 2 with A = IJ_d>i use Proposition 7 to derive 

the following risk bound. 

Corollary 4. Any p-estimator on satisfies, for all distributions Pg E V^, 

mf h‘^{s,s) + — logl(^'\ 
sev (D) n 

where C E (0,1] is a universal constant. 
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In particular, if s G V'{D), 

which means that the elements of v'{D) can be estimated by the p-estimator at a parametric 
rate, up to some (logn)^ factor. This is the case for all uniform densities, for exponential 
densities and their translates and for the Laplace density among many others. 

Remark. For simplicity we have restricted our study to log-concave densities but we could 
as well handle the case of piecewise log-concave densities with several pieces, that is densities 
of the form concave functions gj. The extension would be similar to 

that which leads from monotone to piecewise monotone and is straightforward. 


6. Model selection 


All results of Sections 5.1 to 5.3 were based on the use of a single model: V{D) in 
Section 5.1, in Section 5.2 and in Section 5.3, which implies that our risk bounds 

depend on D in the first case and on k in the other cases. In order to get the best possible 
value of either D or k for the unknown distribution Pg, we may use a selection procedure. 
There are different ways to do this but we shall explain how to do it using Theorem 9 of 
Birge (2006, Section 9). To simplify the presentation, we assume that the number n of 
observations is even with n = 2p and we split the sample X = {Xi ,..., Xn) into two parts 
of size p, XI and X 2 . We also consider all the models ^j+ 2 , j > 1, and ^fc_|_ 2 ) h > I, 
simultaneously. For each of these models we fix a weight A(j) = j and A(k) = k. It follows 
that 

(16) exp[-A(j)] + Y exp[-A(fe)] = 

j>l k>l 


We may now use each of our models to build a p-estimator based on the sample Xi. 
This results in a family of estimators 'Sj{Xi ), j > 1 , and 'sk{Xi), k > 1. The risks of 
these estimators are bounded according to Theorems 3 and 4. In the second step, we 
consider these preliminary estimators based on sample Aii as a set of points in We 
may apply to them the selection procedure described in Section 9.1 of Birge (2006) via a 
T-estimator based on the second sample X 2 . Then Theorem 9 of that paper applies with 
the parameters S = 2/(e — 1), X = 1, q = 2, d = h, k = A and a = p/4. It follows that the 
selection procedure results in an estimator s which satishes 


CIE^[/i^(s,s) I Wi] < min |inf [h‘^{s,Sj{Xi)) + (j/p)] , inf [h‘^{s,Sk{Xi)) + (k/p)] \ . 
We may then take the expectation with respect to Xi and get 

(^^[^^(s,?)] <min|inf [E^ [h^(s,%(Xi))] + (j/p)] , inf [E^ [h^(s, Sfc(A:i))] + (/c/p)] 1 . 

fc>i J 

Now applying Theorems 3 and 4 in order to bound E^ [h‘^{s, Sj(Xi))] and E^ [/i^(s, s^fc(Xi))] 
respectively we derive that the two following bounds hold simultaneously: 


CEs[/i^(s,s)] < iiT 

j>l,M>0 L 


h‘^[s,.^j+ 2 {M)) -\- ( [Mn (logn)^l y [jn ^(logn 
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and 


C'Esr/i^(s, s)l < ii^ 

fc>l,M>0 


{\ognf‘^/^^y {kn ^logn)^) 


This is only a simple example and the same procedure could be applied to a larger family 
of models and preliminary estimators but we shall not insisit on that here, the important 
point being that we may easily extend the results we got for a single model to large families 
of models and get a final bound corresponding to the best bound among all models involved 
in the procedure. 

An alternative selection procedure leading to the same result is described in Baraud (2011, 
Section 6.2). It is also possible to avoid the splitting device by using all models simultane¬ 
ously and a penalized p-estimator as indicated in Section 7 of Baraud et al. (2014). Again, 
we would get in the end the same type of risk bounds. For simplicity, we shall not insist 
on this other approach here. 


7. Proofs 

7.1. Preliminaries. In the sequel, we shall use the following elementary properties. 

Lemma 5. 

1) If ^ is a VC-class of subsets of SC with dimension not larger than d and A C , 
then the same holds for the class ^ r\ A defined by (10). 

2) Let ‘iS be a class of real-valued functions on a set X, g an extremal point of ^ with 
degree d{g) and (p{x) = x" for some positive a. Let X be a class of non-negative functions 
on X such that 

a) there exists f £ X such that </>(/) = g. 

Then f is extremal in X with degree not larger than d(fg). 

3) Let C be a class of subsets of X and Ai,...,Ak be a partition of X. If for all 
j G {1,... ,k}, ^ n Aj is a VC-class with dimension not larger than dj then C is a VC- 
class with dimension not larger than d = Y(!j=i ■ 

Proof. Let B C be a set with cardinality d + 1. Either B <Z A and C'nAnB = C'nB 
for all C G so that B cannot be shattered by ^ or B n is not empty and cannot be of 
the form C H A H B, which proves our first statement. The second statement follows from 
the fact that ^{X,f) = 'C{(j){X),'g) C ^(l#,^). For the third one, we argue as follows: if 
^ could shatter d -I- 1 points, there would exist some j G {1,... , fe} and dj + 1 points of Aj 
that could be shattered by C and hence by ^ n Aj. This would be contradictory with the 
fact that C n Aj is a VC-class with dimension not larger than dj. □ 

7.2. Proof of Theorem 2. For d G N*, let A(d) = {s G A, d{s) = d}. Since S is assumed 

to be separable, A(d) C S' is also separable and we may therefore choose a countable and 
dense subset A(d) of A(d) for each d G N*. Let us now choose a countable and dense 
subset S for S. Possibly changing S into U assume with no loss 

of generality that A(d) C S for all d G N*. Finally, we dehne our estimator as (any) p- 
estimator S' of s based on S following the construction described in Section 4.2 of Baraud 
et al. (2014) as well as the notations of this paper. 
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For y >1 and s G A, we set 

^^{s,s,y) = {t G 5 I h^[s,t) + h^{s,s) < y'^jn} . 

Note that ^^{s,s,y) may be empty. We start our proof with the following lemma. 
Lemma 6. For all y > 1 and s G A 

■^{S,s,y) = jV' ^ e ^^{s,s,y)'^ 

is a weak VC-major class with dimension not larger than d{s). 

Proof. Since (S'/s) is weak VC-major with dimension not larger than d{s) and the map x e-)• 
^^{y/x) is increasing from [0,-t-oo] to [—1,1], it follows from Baraud (2016, Proposition 3) 
that 

^'(S,s,y) = jV’ (\/^) , i G -S'! 

is weak VC-major with dimension not larger than d{s) and so is ^(S, s, y) C ^'(S, s, y). □ 


Let us now go on with the proof of Theorem 2. We fix y > 1, s G A and d = d(s). It 
follows from Baraud (2011, Proposition 3 on page 386 with "if 1^/2 in place of ij)) and the 
definition of ^^{s,s,y) that, for all t G {s,s,y), 


E 


S 




< [6 (/i^(s, t) -|- /i^(s, s))] A 1 < 



A 1. 


Since S is countable and bounded by 1, the family ^(S, s, y) is also countable and its 
elements are bounded by 1. Besides, Lemma 6 ensures that ^(S, s, y) is a weak VC- 
major class with dimension not larger than d > 1. We may therefore apply Corollary 1 of 
Baraud (2016) to the family ^(5,s, y) with b = 1, a'^ = (Gy'^/n) A 1 and get 


w'^(s,s,y) = Es 


sup 

f£.^iS,s,y) 


^(/(W)-E,[/(X,)]) 


i=l 


< 


< 


4r/ 2 nr(d) x ulog 


cr 


+16r(d) 


8\/3T{d) X ylog ( e ( ^ V 1 


n 


16r(d), 


with 


r(d) = log mX] (j) j - = log2 -b (d A n)log 


en 

d An 


In particular, if y^ > r(d )/6 > (d A n)/6 then r(d) < yy6T{d), hence 


(17) 


w (s,s,y) < 8y\j3T{d) 


log 


en 

n Ad 


+ 2 V 2 


for y > yT{d)/(). 


We recall that the quantity D^{s, s) is defined in Section 4.3 of Baraud et al. (2014) by 


D^{s,s) = y^V 1 with yo = sup {y > 0 | w‘^(s, s, y) > coy^ } and cq = 


V 2 -I 

2 V 2 
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It follows from (17) that < w’^(s, s,?/) implies that either y < \lT{d)/Q or 

en 


y<{cQy) V^(s,s,y) < 8co V3r(d) log 


n A d 


+ 2\/2 


= B. 


Since in both cases, < max |r(d)/6; and d = d{s), we deduce that 


(18) 


D^{s, s) < < K[d{s) A n] log^ 


en 


d{s) A n 


< Kd(s) logij 


n 

d(s) 


for all s € s G S and some positive numerical constant k. We now use Theorem 1 
in Baraud et al. (2014) for which we recall that the notation h^(t,t') defined for densities 
t.t' € means Since obtain that for all ^ > 0, with 

probability at least 1 — e~^, 


(19) C h^(s,'s) < inf 

s£S 


h^{s, s) + 


D^{s,s) 


n 


+ — < inf 
n d>i 


inf /i^(s, s) + n-log+ 
sGA(d) n V a/ 


+ 


e 


n 


Finally, A(d) being dense in A(d), 


inf h^(s,s)= inf h‘^(s,s) for all d G N* 
seA{d) s&A{d) 


and the bracketed term on the right-hand side of (19) becomes 
inf inf h^{s,s)+K—log\(^') 

rf>i |_sGA(d) n +Vd/ 

Our conclusion follows. 


= inf^ 

sGA 


h^{s, s) + log+ 

n 


n 


d{s) 


7.3. Proof of Theorem 1. Let D > 1, A > 0 and s G V{D) be based on the partition 
T & J{D + 2). For all t = fi(x,+oo) / G X, the positive part {t — As)+ of t — As is 0 
on (—oo,T] n I and is non-increasing on / n (x,-|-oo). Consequently, {t — As > 0} n / = 
{(t — As)+ > 0} n / n (x,-t- 00 ) is a sub-interval of I (possibly empty) and ‘tf{S,s,X) n I 
is therefore VC with dimension not larger than 2. By Lemma 5, ‘^(S',s, A) is VC with 
dimension not larger than 2{D + 2) and by Proposition 2 the element s is extremal in S 
with dimension not larger than 4(77 -|- 2). Finally Theorem 1 follows from Theorem 2. 

7.4. Proof of Proposition 1. It relies on a series of approximation lemmas that shall 
also prove useful in the sequel. 

Lemma 7. Let f be a monotone function with finite variation Vi{f) on some interval I of 
finite length 1. Then 

j[f{x)-T\^dx< ^ with J=jJ^f{x)dx 

and the factor 1 /4 is optimal. 

Proof. Assuming, without loss of generality, that / is non-increasing, let us observe that 
one can replace / by y with y(x) = /(x — c) — / where c is the left-hand point of I. This 
amounts to assume that c = 0 = f. Let f (0-|-) = a, f (/—) = —6, a + b = Vr( f) = R and 
A = l-^ sup{x 1 fix) > 0} G (0,1). Then 

pXI pI 

/ fix) dx = — fix) dx = Al <l min{aA, 6(1 — A)} = I min{(i? — 6 )A, 6(1 — A)}. 

Jo Jxi 
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A maximization with respect to b and A shows that A< R/A and it follows that 

A rXl A / 

/ f‘^{x)dx= / f‘^{x)dx+ / /“^{x) dx < {a + b)Al = RAl < 

Jo Jo Jxi 4 

The optimality follows by considering the case of / = {R/2) (l(o,// 2 ] “ \i/ 2 ,i))- ^ 

Our next lemma involves the norm in L 2 (M,^(M),dx) hereafter denoted by H-H. 

Lemma 8 . Let f be a non-increasing function on {a,b) with finite variation V(^a,b)if) < 

For all D > 1, there exists a partition X of {a,b) into at most D intervals and a function 
fx which is piecewise constant on each element of the partition X and non-increasing such 
that f{b-) < fx< f{a+), ||/x]l(a, 6 )|| < ||/l(a, 6 )ll 

( 20 ) J fx{x)dx = J f{x)dx, II(/ - /x)ll(a,fe)|| < 

Besides, there exists a partition X' of (a, b) into at most 2D intervals of length not larger 
than [b — a)/D such that for all I € X', Vj{f) < RD~^. The same results hold for non¬ 
decreasing functions on (a, h). 

Proof. Clearly, the results remain valid if we replace f hy g with g = f almost everywhere 
(with respect to the Lebesgue measure), ff(a+) = /(o+) and g{b—) = f{b—). Since / is 
non-increasing on (a, 5), for all x € {a,h) f{x-\-) exists and / admits an at most countable 
number of discontinuities. We may therefore assume that / is actually defined on [a, 6], 
right-continuous on [a, b) and left-continuous at h. 

Starting from xq = a, define recursively for all j > 1, 

Xj = sup{x e [xj-i,b], fixj-i) - f{x) < RD~^} . 

If fc > 1 and Xk < b, f{xk-i) — f{x) > RD~^ for all x > Xk hence f{xk) — f{xk-i) > RD~^ 
since / is right-continuous. In particular for such a k, we necessarily have 

k 

R > /(«) - f{xk) = ^f{xj-i) - f{xj) > kRD~\ 
j=i 

which implies that k < D. The process therefore results in a finite number of distinct points 
xo = a < xi < ... < xk +1 = b with K 1 < D. It also follows from the definition of the 
Xj that f{xj-i) — f{xj—) < RD~^ for I < j < AT -|- 1. Let us now set 

/■Xj . (^+1 

Jj = (xj - Xj-i)~^ / f{x)dx and fx='^ 1 j\xj_r,xj\- 

j=i 

Note that f{b—) < /x < /(«+)) fafx{x)dx = /^^/(x)dx and that fx is non-increasing 
and piecewise constant on a partition of {a,b) into AT -|- 1 intervals. Since, for all j, 0 < 
f{xj-i) — f{xj—) < RD~^, it follows from Lemma 7 that 

\\{f - fx)\a,b)\\ 
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Moreover Jensen’s Inequality implies that 



{Xj - Xj-l) 




dx, 


which shows that ||/il{a,fe) || < ||/l(a, 6 )|| proves the hrst part of the lemma. 

For the second part, define Z' as follows: for each element I £ Z with length i{I) larger 
than (6 — a)/D divide I into \Di{I)/{b — a)] intervals of length not larger than (b — a)/D. 
The process results in a new partition Z' thinner than Z and its cardinality is not larger 
than 


E 

lex 


mi) 

b — a 


sE 

lex 


mi) 

b — a 


+ 1 


< 


D 

b — a 


lex 


+ \Z\ < 2D. 


Since by construction Vj{f) < RD ^ for all I £Z, this property is also true for the elements 
I of the partition Z' which is thinner than Z. For non-decreasing functions, change / to 
-/• □ 


Lemma 9. Given two probability densities t, u with respect to fi, 


( 21 ) 


h{t, u) < 


Vt — X^/u 


for all A G 


In particular, if f is a non-negative element in L 2 (//) such that \\f\\ > 0 and u = (//||/||)^, 
h{t,u) < \\\/t — f\\ for any probability density t with respect to fi. 


Proof. We notice that \/t and are two vectors of norm one in L 2 (/x) and their scalar 
product is J^y/uidpL = cos a for some a G [ 0 , 7 r/ 2 ]. It implies that v = cos a^/u is the 
orthogonal projection of y/i on the linear space generated by ^/u, hence 


inf 

AgR 


Vt — xVu 



sin a. 


Inequality (21) follows from the fact that h?{t,u) = 1 — cos a < sin^ a for all a G [ 0 , 7 r/ 2 ]. 
The last result is obtained from ( 21 ) with A = ||/||. □ 


To complete the proof of Proposition 1, we apply Lemma 8 with f = Vi and R > 
Vt{a-\-) — y^t( 6 —). The resulting function fx is then nonnegative, non-increasing on (a, b) 
and satisfies 0 < \\fi\\. Setting si = f^/ \\fiV , which is an element of V{D), we may apply 
the last part of Lemma 9 with f = fx which gives h{t,sx) < ||/ — fx\\ < RVb — o/(2iJ). 
The conclusion follows by letting R converge to V^a,h] {Vt) ■ 


7.5. Proof of Proposition 4. Let t be based on X = {Iq, • • • Vk+i }, Rj > Vi^ {Vt) for 
1 < J E and let Di,..., be positive integers. On the intervals Iq and Ifc-i-i) t is equal 
to 0 and, for all other intervals of Z, one can apply Lemma 8 to find an approximation fj of 
y/Tj which is monotone, piecewise constant with Dj pieces on Ij and satisfies, according to 
(20), \\{fj - Vi) 1/, II < RjVl{Ij)/{‘^Dj)- Therefore, if / = Yl’j=ifm and u = (//||/||)^ 
we derive from Lemma 9 that 


h?{t, u) < 


f -Vi 


k 

E' 


{fi - xi) h, 


sE 

i=i 


l{ij)iVi 
^ V = M- 

4D] 
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Moreover, we can always assume (modifying it on a negligeable set if necessary) that u 
belongs to V{D') n ^k +2 with D' = Given D, a formal minimization with 

respect to the xj > 0 of under the condition that Ylj=i ^ leads to 

k 

Xj = A with = D, 

i=i 


so that X ^ = D ^ ■ Taking into account the fact that the Dj should 

belong to N*, we finally set 




k 

-1 

D,= 

D 

_i=i 



which implies that Dj < D + k and 


M < 


4L»2 


H 2 r 


i=i 


,2\1/3 


E 


2\1/3 


1 

4^ 




The corresponding function u belongs to V{D + k) H ^k +2 so that 

E («'>)«?)'''“ 

i=i 

The conclusion follows by letting each Rj converge to Vj. (\/t). 


h^{t,v{D + k)n^k+2) 


7.6. Proof of Proposition 6 . It relies on the following approximation lemma. 

Lemma 10. Let f be a continuous and either convex or concave function on [a, 6] with 
right-hand derivative f on {a,b) satisfying V^a,b){f') < +oo. The affine function g on [a,b] 
defined by g{a) = /(a) and g{b) = f{h) satisfies 

sup \f{x)-g{x)\ < 

The factor 1 /4 is optimal. 

Proof. Changing / into —/, we may assume that / is concave on [a, 6 ]. In particular, 
h{x) = f{x) — g{x) > 0 for X G [o, 6 ] and since h is continuous on [a,b] and satisfies 
h{a) = h{b) = 0, there exists some c G (a, b) such that 

sup h{x) = h{c) = [ (yf'{u) — i')du= f [i — f'{u))du with ^ ^ • 

a<x<b J a J c ^ ^ 

The function f being non-increasing on (a, 6 ), 

h{c) < [{f{a+) - i){c - a)] A [{I - f'{b-)){b - c)] 
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and consequently, 


h{c) < 


(/'(«+) - ^) (c - a) 


b — c 
b — a 


+ 


{i-f{b-)) (b-c) 


c — a 
b — a 


(c — a){b — c) 


b — a 


[f'{a+)-i + e-f'{b-)] 


= (b-a) 


c — a 
b — a 


1 - 


c — a 
b — a 




The constant 1/4 cannot be improved since it is reached for f{x) = 1 — Ixj on [—1,1], □ 


Let /' be the function of Lemma 10 and R > V(^a,b){f')- By Lemma 8, one can partition 
(a, 6) into K < 2D intervals Jj, 1 < j < K of length not larger than D~^{b — a) with 
Vj.{f') < RD~^. Using this partition to approximate / by a piecewise affine function qk 
with K pieces and applying Lemma 10, we derive that 

sup |/(x) - gK{x)\ < {l/A)RD~^[{b-a)/D] = {R/A){b - a)D~'^, 

a<x<b 


hence 



\f{x) - gK{x)\‘^ dx < {R/4:f{b 


afD-^. 


Note that, by construction, gx is concave on [o, b] if / is and gx is convex in the opposite 
case. Since y/i satisfies the assumptions of Lemma 10 on each of the k non-extremal 
intervals of the partition X that defines t and is zero on the two extremal intervals, we may 
use the previous approximation method on each non-extremal interval with / = y/i to get 
an approximation v of \/t with D' = 2 Dj pieces and such that 


VI- 


< 


1 

16 


1=1 


-4 

j 


if Ri 


>u. 


iV 


for 1 < j < k. 


Renormalizing v as in Lemma 9, we conclude that there exists u which belongs to Vi{D') n 
and 

k 

hVt,u) < M = 

1=1 

We now mimic the proof of Proposition 4 to optimize the Dj and get 




k 

-1 


D 

_l=i 

([^(/,)]3r2)V5 


so that finally Dj < D + k and 


M < 


1 

16L>4 






1=1 


1 

16L>4 
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The corresponding function u belongs to Vi{2{D + /c)) n so that 

i=i 

The conclusion follows by letting Rj converge to each j. 
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